The Creation of Large-Scale Annotated Corpora of Minority Languages using UniParser and the EANC platform

نویسندگان

Timofey Arkhangeskiy

Oleg Belyaev

Arseniy Vydrin

چکیده

This paper is devoted to the use of two tools for creating morphologically annotated linguistic corpora: UniParser and the EANC platform. The EANC platform is the database and search framework originally developed for the Eastern Armenian National Corpus (www.eanc.net) and later adopted for other languages. UniParser is an automated morphological analysis tool developed specifically for creating corpora of languages with relatively small numbers of native speakers for which the development of parsers from scratch is not feasible. It has been designed for use with the EANC platform and generates XML output in the EANC format. UniParser and the EANC platform have already been used for the creation of the corpora of several languages: Albanian, Kalmyk, Lezgian, Ossetic, of which the Ossetic corpus is the largest (5 million tokens, 10 million planned for 2013), and are currently being employed in construction of the corpora of Buryat and Modern Greek languages. This paper will describe the general architecture of the EANC platform and UniParser, providing the Ossetic corpus as an example of the advantages and disadvantages of the described approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources

We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languag...

متن کامل

WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages

Annotated corpora are sets of structured text used to enable Natural Language Processing (NLP) tasks. Annotations may include tagged parts-of-speech, semantic concepts assigned to phrases, or semantic relationships between these concepts in text. Building annotated corpora is labor-intensive and presents a major obstacle to advancing machine translators, named entity recognizers (NER), part-ofs...

متن کامل

Developing Morphologically Annotated Corpora for Minority Languages of Russia

Despite recent progress in developing annotated corpora for minority languages of Russia, still only about a dozen out of about 100 have comprehensive corpora, and even less have computational tools such as machine translation systems or speech recognition modules. However, given that many of them have resources such as dictionaries and grammars, the situation can be improved at relatively low ...

متن کامل

Analysis of Language Legislation of All 85 Russian Federation’s Subjects (Regions)

The analysis of the language legislation of all 85 subjects of the Russian Federation shows complete heterogeneity and diversity. Common legal guidelines in Federal law do not exist, because Federal legislation is obsolete and is largely whitespace and conflict. The subjects of the Russian Federation, on whose territory different ethnic groups, both large and indigenous, historically live, solv...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

The Creation of Large-Scale Annotated Corpora of Minority Languages using UniParser and the EANC platform

نویسندگان

چکیده

منابع مشابه

A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources

WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages

Developing Morphologically Annotated Corpora for Minority Languages of Russia

Analysis of Language Legislation of All 85 Russian Federation’s Subjects (Regions)

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

عنوان ژورنال:

اشتراک گذاری